setwd("~/MOOCs/Udacity/R Data Science")
pf <- read.csv("pseudo_facebook.tsv", sep='\t')
require(ggplot2)
## Loading required package: ggplot2
Notes:
Notes:
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point()
Response:The highest concentration of numbers of friends is for people under the age of thirty. There are also a surprisingly high number of Facebook users over the age of 90 that have high friend counts – probably more than there actually are in reality. There also seems to be a large number of friends for a certain age over 60 but under 90.
Notes:
summary(pf$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 20.00 28.00 37.28 50.00 113.00
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point() +
xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).
Notes: Overplotting makes it difficult to tell how many points are in each region. Adding a ‘alpha = 1/20’ to our geom_point layer means that it will take 20 points to be the equivalent of one of the black dots in the previous plot.
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha=1/20) +
xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).
Notes: We can also add a jitter to our plot, because the plots seem to be lining up on top of each other, which is not a true reflection of age. If you look at the zoomed plot, the perfect vertical columns seem intuitively wrong. Jitter adds some noise to get a clearer understanding of age versus friend count.
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_jitter(alpha=1/20) +
xlim(13,90)
## Warning: Removed 5183 rows containing missing values (geom_point).
Response: I notice that there is an odd spike out near age 70. I also see that the majority of people with high friend counts are under age 30. But now the friend counts are not nearly as high as they were in the previous plot.
Notes:
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_jitter(alpha=1/20) +
xlim(13,90) +
coord_trans(x = "sqrt")
## Warning: Removed 5183 rows containing missing values (geom_point).
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha=1/20) +
xlim(13,90) +
coord_trans(y = "sqrt")
## Warning: Removed 4906 rows containing missing values (geom_point).
Notes: Coord_trans changes the shape of the plot.
Notes: Here we can add a jitter quality to a geom point instead of using geom jitter, because geom jitter cannot be layered with a coordinate transformation for sqrt against the y variable. (We can transform x, but not y.) We have to use position = position_jitter(h=0) because if we took a friendship initiated count of zero, added a jitter of noise to our point that ended up being negative, and took the sqrt of that, it could be an imaginary number, which would produce an error.
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
geom_point(alpha=1/10, position = position_jitter(h=0)) +
xlim(13,90) +
coord_trans(y = "sqrt")
## Warning: Removed 5181 rows containing missing values (geom_point).
Notes:
Notes:
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
age_groups <- group_by(pf, age)
pf.fc_by_age <- summarise(age_groups,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
pf.fc_by_age <- arrange(pf.fc_by_age, age)
head(pf.fc_by_age)
## Source: local data frame [6 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
pf.fc_by_age <- pf %>%
group_by(age)%>%
summarize(friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n()) %>%
arrange(age)
head(pf.fc_by_age, 20)
## Source: local data frame [20 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
## 7 19 333.6921 157.0 4391
## 8 20 283.4991 135.0 3769
## 9 21 235.9412 121.0 3671
## 10 22 211.3948 106.0 3032
## 11 23 202.8426 93.0 4404
## 12 24 185.7121 92.0 2827
## 13 25 131.0211 62.0 3641
## 14 26 144.0082 75.0 2815
## 15 27 134.1473 72.0 2240
## 16 28 125.8354 66.0 2364
## 17 29 120.8182 66.0 1936
## 18 30 115.2080 67.5 1716
## 19 31 118.4599 63.0 1694
## 20 32 114.2800 63.0 1443
Create your plot!
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
geom_line()
Notes:
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha=1/10,
position = position_jitter(h=0),
color = "orange") +
geom_line(stat = "summary", fun.y = mean) +
geom_line(stat = "summary", fun.y = quantile, fun.args = list(probs = 0.1), linetype = 2, color = "blue") +
geom_line(stat = "summary", fun.y = quantile, fun.args = list(probs = 0.9), linetype = 2, color = "blue") +
geom_line(stat = "summary", fun.y = median, color = "blue") +
coord_cartesian(xlim = c(13,70), ylim = c(0,1000))
Response:We can see that the mean is consistently higher than the median value – the data gets skewed by higher outliers. We can also get a better sense of where the probable ranges are for each age group, and how “tight” the distribution happens to be.
See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.
Notes:
Notes:
cor.test(pf$age, pf$friend_count, method = "pearson")$estimate
## cor
## -0.02740737
#with(pf, cor.test(age, friend_count, method = "pearson"))$estimate
Look up the documentation for the cor.test function.
What’s the correlation between age and friend count? Round to three decimal places. Response:
Notes:
with(subset(pf, age <= 70), cor.test(age, friend_count))$estimate
## cor
## -0.1717245
Notes: the summary statistic tells a story of a negative relationship between age and friend count. But it does not imply causation. To imply causation, we would want to run an experiment and use inferential statistics rather than inferential statistics.
Notes:
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
geom_point(alpha = 1/50,
position = position_jitter(h=0),
color = "green") +
geom_line(stat = "summary", fun.y = mean) +
geom_line(stat = "summary", fun.y = median, color = "red") +
geom_line(stat = "summary", fun.y = quantile, fun.args = list(probs = 0.1), color = "blue", linetype = 2) +
geom_line(stat = "summary", fun.y = quantile, fun.args = list(probs = 0.9), color = "blue", linetype = 2) +
coord_cartesian(xlim = c(0,200), ylim = c(0,200))
Notes:
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
geom_point() +
xlim(0, quantile(pf$www_likes_received, 0.95)) +
ylim(0, quantile(pf$likes_received, 0.95)) +
geom_smooth(method = "lm", color = "red")
## Warning: Removed 6075 rows containing non-finite values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).
What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
cor.test(pf$www_likes_received, pf$likes_received)$estimate
## cor
## 0.9479902
Response: Strong correlations like that can pop up when one set is actually a superset of the other. In the last example, that is what happened. The likes received on a desktop device were correlated with the total likes received, and are highly related by nature. The variables are not independent probably, so we can’t really see what is driving the phenomenon, and that can help us decide which ones to not throw in together for an analysis.
Notes:
Notes:
#install.packages('alr3')
#library(alr3)
Create your plot!
#data("Mitchell")
#?Mitchell
#write.csv(Mitchell, "Mitchell.csv")
Mitchell <- read.csv("Mitchell.csv")
ggplot(aes(x = Month, y = Temp), data = Mitchell) +
geom_point()
I am going to guess that the correlation will be approximately zero, because we are assuming that the check for correlation is looking at a linear model. Temperature data by month will be cyclical, and high temperatures should cancel out low temperatures.
cor.test(Mitchell$Month, Mitchell$Temp)$estimate
## cor
## 0.05747063
Notes:
ggplot(aes(x = Month, y = Temp), data = Mitchell) +
geom_point() +
scale_x_continuous(breaks = seq(0,12*17,12))
What do you notice? Response: There is a cyclical pattern in the data (like a sin or cosine graph).
Watch the solution video and check out the Instructor Notes! Notes: I was right.
Notes:
pf$age_with_months <- (pf$age + ((12 - pf$dob_month)/12))
Programming Assignment
pf.fc_by_age_months <- pf %>%
group_by(age_with_months) %>%
summarize(friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n()) %>%
arrange(age_with_months)
ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) +
geom_line()
Notes: We got two plots. One with age in years and one in age in months. The resolution is different. We have less data to estimate each conditional mean for month bins.
p1 <- ggplot(aes(x = age, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) +
geom_line() +
geom_smooth()
p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) +
geom_line() +
geom_smooth()
p3 <- ggplot(aes(x = round(age / 5) * 5, y = friend_count), data = subset(pf, age < 71)) +
geom_line(stat = "summary", fun.y = mean)
library(gridExtra)
grid.arrange(p2, p1, p3, ncol = 1)
Notes: Sometimes you don’t have to choose! We can explore the different versions. New versions don’t mean that they are better. When we share work with a larger audience, one or two visualizations can be more powerful than a large portfolio of plots.
Reflection: I learned how to deal with generating scatter plots in R. I learned how to jitter graphs and use alpha to get a better look at the density of data points. I learned that correlation is a useful tool, but does not imply causation, nor does it capture all the finer details of what might be happening in a plot. I learned how to zoom into a plot without clipping off data. I learned how to make line graphs, change colors, and plot confidence intervals. I learned how to make data sets of finer resolution, and generate summary data for different categories we might investigate.
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!